index.md (4406B)
1 +++ 2 title = 'Tree models and ensembles' 3 template = 'page-math.html' 4 +++ 5 # Tree models & ensembles 6 7 ## Tree models 8 ### Decision trees (categorical) 9 Work on numerical and categorical features 10 11 Standard decision tree learning algorithm (ID3, C45): 12 - start with empty tree 13 - extend step by step 14 - stop when: all inputs are same (no more features left), or all outputs are same (all instances same class) 15 - greedy (no backtracking) 16 - choose the split that creates least uniform distribution over class labels in the resulting segmentation 17 - entropy is measure of uniformity of distribution 18 - recall, entropy $H(p) = - \sum_{x \in X} P(x) \log{P(x)}$ 19 - conditional entropy: $H(X|Y) = \sum_{y} P(y) H(X | Y = y)$ 20 - information gain of Y: $I_{X}(Y) = H(X) - H(X | Y)$ 21 - so, pick the one with the highest information gain 22 23 The algorithm in steps: 24 1. start with single unlabeled leaf 25 2. loop until no unlabeled leaves: 26 - for each unlabeled leaf l with segment s: 27 - if stop condition, label majority class of S 28 - stop when: all inputs are same (no more features left), or all outputs are same (all instances same class) 29 - else split L on feature F with highest gain Is(F) 30 31 With categoric features, it doesn't make sense to split on the same feature twice. 32 33 ### Regression trees (numeric) 34 For numeric features, split at a numeric threshold t. 35 36 Of course there's a trade-off, complicated decision trees lead to overfitting. 37 38 Pruning - for every split, ask whether the tree classifies better with or without the split (on validation data) 39 40 Using validation data: test is only for final testing, validation for hyperparameter selection. If you want to control search, split training data and use a part of it for 'validation'. 41 42 Label the leaves with the one element, or take the mean. 43 44 Instead of entropy, use $I_{S}(V) = Var(S) - \sum_{i} \frac{| S_i |}{|S|} Var(S_i)$ 45 46 ### Generalization hierarchy 47 ![154f1d15c7808fb3db6bf33d60c61bbb.png](dc5bb005d60e400e9026666b704139da.png) 48 49 50 ## Ensembling methods 51 Bias and variance: 52 53 ![ba23c24af56d5203ba58cdc8b1b3f8d8.png](2b81999a96204ddeb7dc539b580b8dbb.png) 54 55 Real-life example: 56 - grading by rubric: high bias, low variance 57 - grading by TA: low bias, high variance 58 59 Bootstrapping (from methodology 2): 60 - sample with replacement a dataset of same size as whole dataset 61 - each bootstrapped sample lets you repeat your experiment 62 - better than cross validation for small datasets 63 - but some classifiers don't like duplicates 64 65 Ensembling is: 66 - used in production to get better performance from model 67 - never used in research, we can improve any model by boosting 68 - can be expensive for big models 69 70 Bagging reduces variance, boosting reduces bias. 71 72 ### Bagging 73 Bagging: **b**ootstrap **agg**regating 74 - resample k datasets, train k models. this is the ensemble 75 - ensemble classifies by majority vote 76 - for class probabilities, use relative freq among votes 77 78 Random forest: bagging with decision trees 79 - subsample data and features for each model in ensemble 80 - pro: reduces variance, few hyperparameters, easy to parallelize 81 - con: no reduction of bias 82 83 ### Boosting 84 train some classifier m0, then iteratively train more classifiers. 85 increase weights for instances misclassified by a classifier. 86 train the next iteration on reweighted data. 87 88 weighted loss function: $loss(\theta) = \sum_{i} w_{i} (f_{\theta}(x_i) - t_i)^2$ 89 90 or resample data by the weights. the weight determines how likely an instance is to end up in the resampled data. 91 92 boosting works even if the model only classifies slightly better than random. 93 94 #### AdaBoost (binary classifiers) 95 TODO: think of a better way to explain this for the exam 96 97 each model fits a reweighted dataset, each model defines its own reweighted loss. 98 99 #### Gradient boosting 100 boosting for regression models 101 102 fit the next model to the residuals of the current ensemble 103 104 each model fits (pseudo) residuals of previous model, ensemble optimises a global loss -- even if individual models don't optimise a well-defined loss. 105 106 ### Stacking 107 When you want to combine some number of existing models into a single model. 108 Simply compute outputs of the models for every point in the dataset, and add them to the dataset as a column. 109 Then, train a new model ('combiner') on this extended data (usually a logistic regression). If NNs are used for the ensemble, the whole thing turns into a big NN.